PM592:Data Analysis for Covid 19 Vaccine Hesitancy and possible demographic and geographic correlations

Author

Nicole Tang

PM 592 Final Project

Link to full code on Github: https://github.com/nktang05/PM566Final/blob/main/PM592.qmd

Introduction

Introduction

COVID-19 vaccine hesitancy refers to the reluctance or refusal to get vaccinated despite the availability of vaccines. Vaccination plays a crucial role in controlling the pandemic by reducing the spread of the virus, preventing severe illness, and decreasing hospitalization and death rates. However, hesitancy has been influenced by factors such as misinformation, distrust in healthcare systems or government authorities, concerns about the speed of vaccine development, and fears about potential side effects. Social, cultural, and political contexts have also shaped people’s attitudes toward vaccines.

This data set has various demographic information showing information by county, state, ethnicity, COVID-19 vaccine coverage (CVAC) and social vulnerability index (SVI). In order to determine hesitancy levels, people were surveyed “Once a vaccine to prevent COVID-19 is available to you, would you…get a vaccine?” and the following options were: 1) “definitely get a vaccine”; 2) “probably get a vaccine”; 3) “unsure”; 4) “probably not get a vaccine”; 5) “definitely not get a vaccine”. his data set also looks into varying levels of hesitancy: hesitant, hesitant or unsure, or strongly hesitant. People who responded “probably not” or “definitely not” were categorized as hesitant.

Data set origin: https://data.cdc.gov/Vaccinations/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/q9mh-h2tw/about_data

I also utilized a data set on educational attainment (Bachelors degree or higher) by state for the year 2021. Data set origin:https://fred.stlouisfed.org/release/tables?rid=330&eid=391444&od=2021-01-01#

I also utilized a data set on COVID-19 Mortality by state for 2021. Data set origin: https://www.cdc.gov/nchs/pressroom/sosmap/covid19_mortality_final/COVID19.htm

Research Question Are there any correlations between demographic, geographical, and social factors and the rates of vaccine hesitancy?

Identify a health-related outcome variable that you want to assess The heath variable I am going to assess is rates of vaccine hesitancy.

Identify 2 independent variables that may be associated with that health outcome The first independent variable is social vulnerability index (SVI). SVI is measure of how much a community is vulnerable based on things like socioeconomic status, minority status, and housing. Higher SVI scores may reflect structural barriers and lower trust in public health, potentially leading to higher vaccine hesitancy. The second independent variable is vaccine coverage index (CVAC). VAC measures supply and demand challenges to vaccine rollout based on healthcare accessibility barriers, sociodemographic barriers, and historic undervaccination. Higher CVAC scores indicate greater challenges in vaccine distribution, which may correlate with increased hesitancy. The third independent variable is predominant ethnicity. Cultural factors may influence vaccine hesitancy among specific groups.

Identify 1 independent variable that may be a confounder, and 1 independent variable that may be an effect modifier The first possible confounder is % Adults of fully vaccinated. It may be a confounder because higher vaccination rates could reduce hesitancy by normalizing it or reflecting better healthcare infrastructure. At the same time, vaccination rates may also reflect better healthcare infrastructure which could influence predictors like SVI and CVAC. The second possible confounder is % Bachelors degree or above. It may be a confounder because higher educational attainment is often associated with greater health literacy and lower vaccine hesitancy. Communities with higher education levels might have lower SVI scores and higher CVAC scores due to better socioeconomic conditions.

The first possible effect modifier is region. It may be an effect modifier because vaccine hesitancy patterns vary across regions due to cultural, political, and healthcare access differences. The second possible effect modifier is covid death rates per 100,000. It may be an effect modifier because higher death rates may increase the perceived threat of COVID-19, modifying the relationship between predictors like CVAC or SVI and vaccine hesitancy.

Methods

Data Cleaning and Wrangling A csv file downloaded to my files from the CDC website was read into a data frame. 280 observations with NA were removed to clean the data. One of the data columns held latitude and longitudinal information in the data type “Point”. So, I coded two new variable columns for latitude and longitude so it is in a more usable form for future visualizations. I also added a column for region (Northeast, Midwest, South, West) based on the state. The regions were picked based on Census Bureau designated regions. I also utilized a data set on educational attainment (Bachelors degree or higher) by state for the year 2021 to test for possible correlations with education level.I also utilized a data set on COVID-19 Mortality by state for 2021 from the CDC to test for possible correlations with covid death rates. The following variables are continuous: Average percent of adults fully vaccinated, Average percent of adults with a bachelors degree or higher, Average covid death rate per 100,000, Average SVI, and Average CVAC. Region and predominant ethnicity are categorical.

Aggregate Hesitancy Rates by Ethnicity The data set has columns for each ethnicity and the percentage of that ethnicity in the region. I made a new categorical variable column that’s value is the predominant ethnicity of that location.

Section 1: Preliminary Analysis

Summary Statistics for Vaccine Hesitancy
Mean Hesitant (%) SD Hesitant Min Hesitant (%) Max Hesitant (%)
13.42069 4.785884 2.69 26.7

Basic preliminary analysis of hesitancy indicates a mean hesitancy at 13.42% with a standard deviation of 4.7% and minimun of 2.69% and maximum of 26.7%

Numeric Variable Descriptive Statistics
N Mean_Percent_Adults_Fully_Vaccinated SD_Percent_Adults_Fully_Vaccinated Mean_Percent_Bachelors_Degree SD_Percent_Bachelors_Degree Mean_COVID_Death_Rate SD_COVID_Death_Rate Mean_svi sd_svi Mean_cvac sd_cvac
2862 39.93113 14.28988 33.09312 5.015483 101.1049 35.60613 0.4838959 0.2880657 0.46587 0.275621
Predominant Ethnicity
Category Count Percentage
non-Hispanic White 2646 92.4528302
non-Hispanic Black 128 4.4723969
non-Hispanic American Indian/Alaska Native 33 1.1530398
non-Hispanic Asian 2 0.0698812
Hispanic 53 1.8518519
Region
Category Count Percentage
Midwest 1055 36.862334
Northeast 218 7.617051
South 1185 41.404612
West 404 14.116003

Preliminary analysis of the variables: Average percent of adults fully vaccinated:36.93%

Average percent of adults with a bachelors degree or higher: 33.09

Average covid death rate per 100,000: 101.10

Average SVI: .48

Average CVAC: .47

% of areas with the predominant ethnicity being: non-Hispanic White: 92.45%

non-Hispanic Black: 4.47%

non-Hispanic American Indian/Alaska Native: 1.15%

non-Hispanic Asian: 0.069%

Hispanic: 1.85%

% of reporting locations in the following regions: Midwest: 36.86%

Northeast: 7.62%

South: 41.40%

West: 14.12%

Section 2: Simply X, Y relationship

Table: Regression Results for Predictors of Vaccine Hesitancy
Model Term Coefficient Standard Error P-Value Lower 95% CI Upper 95% CI
Numeric Predictors
Numeric: Social Vulnerability Index (Intercept) 10.8751776 0.1659400 0.0000000 10.5498035 11.2005517
Numeric: Social Vulnerability Index `Social Vulnerability Index (SVI)` 5.2604513 0.2946778 0.0000000 4.6826489 5.8382537
Numeric: CVAC Level of Concern (Intercept) 10.1086408 0.1603391 0.0000000 9.7942490 10.4230326
Numeric: CVAC Level of Concern `CVAC level of concern for vaccination rollout` 7.1093812 0.2962265 0.0000000 6.5285421 7.6902202
Numeric: % Adults Fully Vaccinated (Intercept) 16.0794578 0.2602938 0.0000000 15.5690753 16.5898403
Numeric: % Adults Fully Vaccinated `Percent adults fully vaccinated against COVID-19 (as of 6/10/21)` -0.0665839 0.0061375 0.0000000 -0.0786183 -0.0545494
Numeric: % Bachelors Degree or Above (Intercept) 34.2610010 0.4486915 0.0000000 33.3812095 35.1407925
Numeric: % Bachelors Degree or Above education -0.6297477 0.0134054 0.0000000 -0.6560329 -0.6034624
Numeric: COVID Death Rate per 100,000 (Intercept) 13.4671389 0.2681428 0.0000000 12.9413643 13.9929135
Numeric: COVID Death Rate per 100,000 `covidDeathRateper100,000` -0.0011289 0.0025016 0.6518146 -0.0060340 0.0037762
Categorical Predictors: Ethnicity
Categorical: Predominant Ethnicity (Intercept) 13.3483900 0.0913358 0.0000000 13.1692993 13.5274807
Categorical: Predominant Ethnicity Predominant_Ethnicitynon-Hispanic Black 2.1855162 0.4251959 0.0000003 1.3517943 3.0192382
Categorical: Predominant Ethnicity Predominant_Ethnicitynon-Hispanic American Indian/Alaska Native 4.9482766 0.8229439 0.0000000 3.3346525 6.5619007
Categorical: Predominant Ethnicity Predominant_Ethnicitynon-Hispanic Asian 0.5216100 3.3234172 0.8752954 -5.9949287 7.0381487
Categorical: Predominant Ethnicity Predominant_EthnicityHispanic -4.4748051 0.6517850 0.0000000 -5.7528217 -3.1967885
Categorical Predictors: Region
Categorical: Region (Intercept) 12.2471090 0.1331381 0.0000000 11.9860527 12.5081653
Categorical: Region RegionNortheast -3.9495402 0.3217275 0.0000000 -4.5803816 -3.3186988
Categorical: Region RegionSouth 3.2073804 0.1830489 0.0000000 2.8484593 3.5663016
Categorical: Region RegionWest 1.0372227 0.2530109 0.0000426 0.5411204 1.5333249

Simple regression anlaysis indicated that SVI was statistically significant with a p-value <.001 and a coefficient of 5.26, indicating for every unit increase in SVI, hesitancy percentage is expected to increase by 5.26. CVAC was statistically significant with a p-value <.001 and a coefficient of 7.11, indicating for every unit increase in CVSC, hesitancy percentage is expected to increase by 7.11. Percent of adults fully vaccinated was statistically significant with a p-value <.001 and a coefficient of 0.07, indicating for every unit increase in SVI, hesitancy percentage is expected to decrease by 0.07. Education was statistically significant with a p-value <.001 and a coefficient of 0.63, indicating for every unit increase in percent educated, hesitancy percentage is expected to decrease by 0.63. Covid deaths were not statistically significant indicating that they have no direct correlation with hesitancy levels. When looking at ethnicity with non hispanic white being the baseline, all other ethnicity groups are more likely to be hesitant except Hispanic. With the midwest as the baseline, all other regions are more likely to be more hesitant except the northeast which would be less hesitant.

Section 2: Check for confounders

I checked if percent adults fully vaccinated and education percentage were confounders on the independent variables, SVI, CVAC, and predominant ethnicity. From running the different models, percent adults fully vaccinated does not appear to be a confounder because the coefficients of the independent variables are similar to the coefficients in the model ran with this possible confounder. However, education appears to be a possible confounders. The initial model has the SVI coefficient at 1.40 and when adjusted for education is .60. It is important to note that in the model adjusting for education the SVI is no longer statistically significant. The initial model has the CVAC coefficient at 6.38 and when adjusted for education is 3.58 and is statisticallly significant. The model with education also has a R-squared of .50, higher than the original model with R-squared at .20, indicating a better fit when adjusted for education.

Section 2: Check for effect modification

I then checked if region and covid deaths per 100,000 were possible effect modifiers. When testing region with SVI, regionSouth (2.36e-06) and regionWest (9.62e-12) both produce statistically significant results.When testing region with CVAC, regionSouth (0.002982) and regionNortheast (0.006293) both produce statistically significant results. When testing region with predominant ethnicity, regionSouth and Ethnicity non-Hispanic American Indian/Alaska Native (0.003821), regionWest and Ethnicity non-Hispanic American Indian/Alaska Native (1.87e-08), and regionSouth and Ethnicity Hispanic (0.041065) produce statistically significant results. When testing covid death rates with SVI it produces statistically significant results (< 2e-16). When testing covid death rates with CVSC it produces statistically significant results (< 2e-16). When testing covid death rates with ethnicity, Ethnicitynon-Hispanic American Indian/Alaska Native (0.000126) and Ethnicity non-Hispanic Asian ( 0.003208) both produce statistically significant results (< 2e-16). Based on these p value findings I concluded that region and covid death rates were both effect modifiers.

Section 3 Table 3: Final Model


Call:
lm(formula = `Estimated hesitant` ~ +(`CVAC level of concern for vaccination rollout` * 
    Region) + (education), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4084  -1.8940  -0.1932   1.8329  14.9730 

Coefficients:
                                                                  Estimate
(Intercept)                                                      30.359353
`CVAC level of concern for vaccination rollout`                   2.699532
RegionNortheast                                                  -0.007013
RegionSouth                                                       0.121749
RegionWest                                                        7.845878
education                                                        -0.577217
`CVAC level of concern for vaccination rollout`:RegionNortheast   0.041016
`CVAC level of concern for vaccination rollout`:RegionSouth       2.177955
`CVAC level of concern for vaccination rollout`:RegionWest      -11.516726
                                                                Std. Error
(Intercept)                                                       0.493490
`CVAC level of concern for vaccination rollout`                   0.408555
RegionNortheast                                                   0.428486
RegionSouth                                                       0.296376
RegionWest                                                        0.449591
education                                                         0.013917
`CVAC level of concern for vaccination rollout`:RegionNortheast   1.685704
`CVAC level of concern for vaccination rollout`:RegionSouth       0.549589
`CVAC level of concern for vaccination rollout`:RegionWest        0.864737
                                                                t value
(Intercept)                                                      61.520
`CVAC level of concern for vaccination rollout`                   6.608
RegionNortheast                                                  -0.016
RegionSouth                                                       0.411
RegionWest                                                       17.451
education                                                       -41.477
`CVAC level of concern for vaccination rollout`:RegionNortheast   0.024
`CVAC level of concern for vaccination rollout`:RegionSouth       3.963
`CVAC level of concern for vaccination rollout`:RegionWest      -13.318
                                                                Pr(>|t|)    
(Intercept)                                                      < 2e-16 ***
`CVAC level of concern for vaccination rollout`                 4.65e-11 ***
RegionNortheast                                                    0.987    
RegionSouth                                                        0.681    
RegionWest                                                       < 2e-16 ***
education                                                        < 2e-16 ***
`CVAC level of concern for vaccination rollout`:RegionNortheast    0.981    
`CVAC level of concern for vaccination rollout`:RegionSouth     7.59e-05 ***
`CVAC level of concern for vaccination rollout`:RegionWest       < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.218 on 2853 degrees of freedom
Multiple R-squared:  0.5492,    Adjusted R-squared:  0.548 
F-statistic: 434.5 on 8 and 2853 DF,  p-value: < 2.2e-16

To determine the best model I started with a baseline model including region and covid deaths as effect modifiers and education as a confounding variable (R-squared = .6772). I then used the stepAIC to help make the model more parsimonious. The stepAIC outputed Estimated hesitant ~ + (CVAC level of concern for vaccination rollout * covidDeathRateper100,000 * Region) + (Region * Predominant_Ethnicity) + (covidDeathRateper100,000 * education) with a r-squared value of .6771. While it was more parsimonious than the original model and had almost the same r-squared value I wanted to make it even more parsimonious.

The next model is Estimated hesitant ~ + (CVAC level of concern for vaccination rollout * covidDeathRateper100,000 * Region) + ( education) with a r squared value of .6314. This model is more parsimonious while maintaining as large a r-squared value as possible.

This model formula with coefficients

Estimated_hesitant = 25.227 + 0.208⋅SVI − 1.182⋅CVAC + 0.0216⋅DeathRate + 4.825⋅RegionNortheast − 5.031⋅RegionSouth + 28.553⋅RegionWest − 0.516⋅Education + 0.0454(CVAC⋅DeathRate) + 8.376⋅(CVAC⋅RegionNortheast) + 18.550⋅(CVAC⋅RegionSouth) − 33.204⋅(CVAC⋅RegionWest) + 0.0454⋅(CVAC⋅DeathRate) + 8.376⋅(CVAC⋅Region Northeast) + 18.550⋅(CVAC⋅Region South ) − 33.204⋅(CVAC⋅Region West ) − 0.0451⋅(DeathRate⋅RegionNortheast) + 0.0728⋅ (DeathRate⋅RegionSouth) − 0.2356⋅(DeathRate⋅RegionWest) − 0.0975⋅(CVAC⋅DeathRate⋅RegionNortheast) − 0.1867⋅(CVAC⋅DeathRate⋅RegionSouth) + 0.2515⋅(CVAC⋅DeathRate⋅RegionWest) − 0.0451⋅(DeathRate⋅Region Northeast) + 0.0728⋅(DeathRate⋅Region South ) − 0.2356⋅(DeathRate⋅Region West) − 0.0975⋅(CVAC⋅DeathRate⋅Region Northeast) − 0.1867⋅(CVAC⋅DeathRate⋅Region South) + 0.2515⋅(CVAC⋅DeathRate⋅Region West)

However, I thought this model to be a bit too overcomplicated when the effect modifiers were both used together so I kept working to find a simplier model. The model is Estimated hesitant ~ +(CVAC level of concern for vaccination rollout * Region) + (education). This model is much simplier an more parsimonious. However, it does loose some r-squared value and is now .5492.

This model formula with coefficients

Estimated hesitant = 30.359 + 2.700⋅CVAC −0.007RegionNortheast + 0.122RegionSouth + 7.846RegionWest −0.577Education + 0.041(CVAC⋅RegionNortheast) + 2.178(CVAC⋅RegionSouth) −11.517(CVAC⋅Region West)

The intercept(<2e-16), CVAC (4.65e-11), education(<2e-16), RegionWest (<2e-16), CVACxRegionSouth (7.59e-05) and CVACxRegionWest(< 2e-16) are all statistically significant. When all variables are 0, the expected hesitancy is 30.35. With each unit increase of CVAC, hesitancy is expected to increase by 2.70. The midwest is the baseline. If the region is the West, the hesitancy is expected to increase by 7.84. When CVAC and region south are together, the hesitancy is expected to increase by 2.18. When CVAC and region west are together hesitancy is expected to decrease by 11.51. CVAC, Region south, Region West, and CVAC*RegionSouth and CVACxRegionWest will increase the hesitancy rates. Region Northeast, education, and CVACxRegionWest will decrease hesitancy rates.

Goodness of Fit

To test the residuals I used autoplot. The residuals appear to have a slight curve indicating possible non linearity but looks mostly linear. The tails deviate on the normality graph indicating slight non normality in the residuals. When checking for homoscedascity it has a slight upward trend indicating some variance.

Conclusion

Overall, the graph could be stronger and a better predictor but that would mean giving up some parsimony. In the end my model had a pretty descent r-squared value of .5492. This means the model provides a reasonable explanation for vaccine hesitancy, explaining about 55% of its variability.

Interventions

To address vaccine hesitancy, interventions should focus on strategies targeting high-hesitancy states and regions. This can be done with local messaging, education, and community engagement. More vulnerable populations should be prioritized. One thing that could be improved upon is accessibility to clinics not just economically but physically with transportation. Combatting misinformation is essential to counter vaccine myths effectively. When working with diverse populations, interventions should involve collaborations with community leaders and culturally tailored messaging to build trust and address unique barriers faced by diverse groups. Culturally relevant interventions, like Es Tiempo, a campaign raises awareness of cervical cancer prevention among Latinas, has proven to be successful. More data collection and evaluation will help in sustaining vaccination rates across all communities.

Additional Tables and Insight

Figure 3: Average Hesitancy Rates by Social Vulnerability Index Very high vulnerability has the highest median estimated hesitancy (16.76%). Very low vulnerability has the lowest median rates of hesitancy (10.55%). This is interesting because you would think that the higher vulnerability would not be quite so hesitant.

Figure 4: Average Hesitancy Rates by CVAC Very high concern has the highest median estimated hesitancy (16.80%). Low concern has the lowest median rates of hesitancy (10.18%). This makes sense that areas where there is very high concern of vaccine rollout challenges could be high levels of hesitancy. For example, misinformation could be the cause of high levels of hesitancy and cause challenges to vaccine rollouts.

Figure 5: Average Hesitancy Rates by Ethnicity This graph displays the relationship between the percentage of an ethnicity in a population and the estimated hesitancy levels. Some groups like non-Hispanic Asians appear to have lower overall hesitancy, while groups such as non-Hispanic Black and non-Hispanic American Indian/Alaska Native show a wider spread and higher average hesitancy.